# Multimodal Contrastive Learning

PE Core B16 224
Apache-2.0
The Perception Encoder is a state-of-the-art image and video understanding encoder trained through simple vision-language learning, achieving top performance across various visual tasks.
Text-to-Image
P
facebook
9,663
11
PE Core G14 448
Apache-2.0
The Perception Encoder (PE) is a state-of-the-art image and video understanding encoder trained through simple vision-language learning, achieving top performance across various visual tasks.
Text-to-Image
P
facebook
22.83k
14
PE Core L14 336
Apache-2.0
A large-scale visual encoder model developed by Meta, achieving state-of-the-art performance in various vision tasks through contrastive pre-training and fine-tuning on synthetic video data
Text-to-Image
P
facebook
11.52k
34
Sail Clip Hendrix 10epochs
A vision-language model fine-tuned from openai/clip-vit-large-patch14, trained for 10 epochs
Text-to-Image Transformers
S
cringgaard
49
0
Git RSCLIP
Apache-2.0
Git-RSCLIP is a vision-language model pretrained on the Git-10M dataset, specializing in multimodal understanding of remote sensing images.
Text-to-Image Safetensors
G
lcybuaa
59.37k
4
Vit SO400M 14 SigLIP2
Apache-2.0
A SigLIP 2 vision-language model trained on the WebLI dataset, suitable for zero-shot image classification tasks.
Text-to-Image
V
timm
1,178
0
Eva02 Large Patch14 Clip 336.merged2b
MIT
EVA02 CLIP is a large-scale vision-language model based on the CLIP architecture, supporting tasks such as zero-shot image classification.
Text-to-Image
E
timm
197
0
Brahmai Clip V0.1
MIT
CLIP model based on ViT-L/14 and masked self-attention Transformer for zero-shot image classification research
Text-to-Image Transformers English
B
brahmairesearch
12.53k
0
Resnet50x64 Clip.openai
MIT
CLIP model based on the ResNet50x64 architecture from the OpenCLIP library, supporting zero-shot image classification tasks.
Image Classification
R
timm
622
0
Fashion Embedder
MIT
FashionCLIP is a vision-language model based on CLIP, specifically fine-tuned for the fashion domain, capable of generating universal fashion product representations.
Text-to-Image Transformers English
F
McClain
58
0
FLIP Base 32
Apache-2.0
This is a vision-language model based on the CLIP architecture, specifically post-trained on 80 million face images.
Multimodal Fusion Transformers
F
FLIP-dataset
16
0
Clip Vit Base Patch32
CLIP model developed by OpenAI, based on Vision Transformer architecture, supporting joint understanding of images and text
Text-to-Image Transformers
C
Xenova
177.13k
8
CLIP ViT B 16 DataComp.L S1b B8k
MIT
A zero-shot image classification model based on the CLIP architecture, trained using the DataComp dataset, supporting efficient image-text matching tasks.
Text-to-Image
C
laion
1,166
1
CLIP ViT B 16 CommonPool.L.clip S1b B8k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
138
0
CLIP ViT B 16 CommonPool.L.laion S1b B8k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks, trained on the laion-s1B-b8K dataset
Text-to-Image
C
laion
106
0
CLIP ViT B 16 CommonPool.L.text S1b B8k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
58
0
CLIP ViT B 16 CommonPool.L S1b B8k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
517
0
CLIP ViT B 32 DataComp.M S128m B4k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks, trained on the DataComp.M dataset
Text-to-Image
C
laion
212
0
CLIP ViT B 32 CommonPool.M.laion S128m B4k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
65
0
CLIP ViT B 32 CommonPool.M.image S128m B4k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
73
0
CLIP ViT B 32 CommonPool.M.text S128m B4k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
68
0
CLIP ViT B 32 CommonPool.M.basic S128m B4k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.
Text-to-Image
C
laion
67
0
CLIP ViT B 32 CommonPool.M S128m B4k
MIT
Zero-shot image classification model based on CLIP architecture, supporting general vision-language tasks
Text-to-Image
C
laion
79
0
CLIP ViT B 32 DataComp.S S13m B4k
MIT
A zero-shot image classification model based on the CLIP architecture, trained on the DataComp dataset, supporting various vision tasks.
Text-to-Image
C
laion
92
0
CLIP ViT B 32 CommonPool.S.clip S13m B4k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
68
0
CLIP ViT B 32 CommonPool.S S13m B4k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
79
0
Eva02 Enormous Patch14 Clip 224.laion2b S4b B115k
MIT
Large-scale vision-language model based on EVA02 architecture, supporting zero-shot image classification tasks
Text-to-Image
E
timm
130
1
Align Base
ALIGN is a vision-language dual-encoder model that aligns image and text representations through contrastive learning, achieving state-of-the-art cross-modal representation with large-scale noisy data.
Multimodal Alignment Transformers English
A
kakaobrain
78.28k
25
Fashion Clip
MIT
FashionCLIP is a vision-language model fine-tuned specifically for the fashion domain based on CLIP, capable of generating universal product representations.
Text-to-Image Transformers English
F
patrickjohncyh
3.8M
222
Altclip
Openrail
AltCLIP is a simple and efficient bilingual CLIP model supporting Chinese-English text-image representation tasks.
Text-to-Image Transformers Supports Multiple Languages
A
BAAI
12.78k
28
Vit Base Patch16 Clip 224.openai
Apache-2.0
CLIP is a vision-language model developed by OpenAI, trained via contrastive learning for image and text encoders, supporting zero-shot image classification.
Text-to-Image Transformers
V
timm
618.17k
7
Biomedvlp CXR BERT General
MIT
CXR-BERT is a specialized language model developed for the chest X-ray domain, optimized for radiology text processing through improved vocabulary and pretraining procedures
Large Language Model Transformers English
B
microsoft
12.31k
37
Clip Rsicd
A remote sensing image-specific model fine-tuned based on OpenAI CLIP, enhancing zero-shot classification and image retrieval capabilities
Text-to-Image
C
flax-community
146
4
Clip Vit Base Patch16
CLIP is a multimodal model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, enabling zero-shot image classification capabilities.
Image-to-Text
C
openai
4.6M
119
Clip Rsicd V2
A remote sensing image-specific model fine-tuned based on OpenAI CLIP, enhancing zero-shot classification and cross-modal retrieval capabilities
Text-to-Image
C
flax-community
3,229
23
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase